OcrV1, Main, Exploration, bibRecord, 000572

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Identifieur interne : 000572 ( Main/Exploration ); précédent : 000571; suivant : 000573

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Auteurs : Martin W. C. Reynaert [Pays-Bas]

Source :

International journal on document analysis and recognition : (Print) [ 1433-2833 ] ; 2011.

RBID : Pascal:11-0343814

Descripteurs français

Pascal (Inist)
- Reconnaissance caractère, Reconnaissance optique caractère, Hachage, Texte, Chaîne caractère, Langage naturel, Typographie, Distance, Analyse statistique, ., Appariement chaîne.

English descriptors

KwdEn :
- Character recognition, Character string, Distance, Hashing, Natural language, Optical character recognition, Statistical analysis, String matching, Text, Typography.

Abstract

We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (TICCL) is compared to its focus word-based counterpart and evaluated on 6 years' worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more tradi- tional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper 'Het Volk' show that the system is not sensitive to domain variation.

Affiliations:

Pays-Bas

Links toward previous steps (curation, corpus...)

to stream PascalFrancis, to step Corpus: 000123
to stream PascalFrancis, to step Curation: 000650
to stream PascalFrancis, to step Checkpoint: 000126
to stream Main, to step Merge: 000578
to stream Main, to step Curation: 000572

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en" level="a">Character confusion versus focus word-based correction of spelling and OCR variants in corpora</title>
<author><name sortKey="Reynaert, Martin W C" sort="Reynaert, Martin W C" uniqKey="Reynaert M" first="Martin W. C." last="Reynaert">Martin W. C. Reynaert</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Tilburg Centre for Cognition and Communication, Tilburg University, Kamer D 342, P.O. Box 90153</s1>
<s2>5000 LE, Tilburg</s2>
<s3>NLD</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Pays-Bas</country>
<wicri:noRegion>5000 LE, Tilburg</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">INIST</idno>
<idno type="inist">11-0343814</idno>
<date when="2011">2011</date>
<idno type="stanalyst">PASCAL 11-0343814 INIST</idno>
<idno type="RBID">Pascal:11-0343814</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000123</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000650</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000126</idno>
<idno type="wicri:doubleKey">1433-2833:2011:Reynaert M:character:confusion:versus</idno>
<idno type="wicri:Area/Main/Merge">000578</idno>
<idno type="wicri:Area/Main/Curation">000572</idno>
<idno type="wicri:Area/Main/Exploration">000572</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a">Character confusion versus focus word-based correction of spelling and OCR variants in corpora</title>
<author><name sortKey="Reynaert, Martin W C" sort="Reynaert, Martin W C" uniqKey="Reynaert M" first="Martin W. C." last="Reynaert">Martin W. C. Reynaert</name>
<affiliation wicri:level="1"><inist:fA14 i1="01"><s1>Tilburg Centre for Cognition and Communication, Tilburg University, Kamer D 342, P.O. Box 90153</s1>
<s2>5000 LE, Tilburg</s2>
<s3>NLD</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Pays-Bas</country>
<wicri:noRegion>5000 LE, Tilburg</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint><date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Character string</term>
<term>Distance</term>
<term>Hashing</term>
<term>Natural language</term>
<term>Optical character recognition</term>
<term>Statistical analysis</term>
<term>String matching</term>
<term>Text</term>
<term>Typography</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Hachage</term>
<term>Texte</term>
<term>Chaîne caractère</term>
<term>Langage naturel</term>
<term>Typographie</term>
<term>Distance</term>
<term>Analyse statistique</term>
<term>.</term>
<term>Appariement chaîne</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We present a new approach based on anagram hashing to handle globally the lexical variation in large and noisy text collections. Lexical variation addressed by spelling correction systems is primarily typographical variation. This is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbors is applied, where near-neighbors are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbors constitutes a particular character confusion. We present a global way of performing this action: for all possible particular character confusions given a particular edit distance, we sequentially identify all the pairs of text strings in the text collection that display a particular confusion. We work on large digitized corpora, which contain lexical variation due to both the OCR process and typographical or typesetting error and show that all these types of variation can be handled equally well in the framework we present. The character confusion-based prototype of Text-Induced Corpus Clean-up (TICCL) is compared to its focus word-based counterpart and evaluated on 6 years' worth of digitized Dutch Parliamentary documents. The character confusion approach is shown to gain an order of magnitude in speed on its word-based counterpart on large corpora. Insights gained about the useful contribution of global corpus variation statistics are shown to also benefit the more tradi- tional word-based approach to spelling correction. Final tests on a held-out set comprising the 1918 edition of the Dutch daily newspaper 'Het Volk' show that the system is not sensitive to domain variation.</div>
</front>
</TEI>
<affiliations><list><country><li>Pays-Bas</li>
</country>
</list>
<tree><country name="Pays-Bas"><noRegion><name sortKey="Reynaert, Martin W C" sort="Reynaert, Martin W C" uniqKey="Reynaert M" first="Martin W. C." last="Reynaert">Martin W. C. Reynaert</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000572 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000572 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:11-0343814
   |texte=   Character confusion versus focus word-based correction of spelling and OCR variants in corpora
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Character confusion versus focus word-based correction of spelling and OCR variants in corpora

Source :

Descripteurs français

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri